Column
INTRODUCTION
This pipeline walkthrough is for fastq ITS2 data, and is coded in R.
Included within this document are six sections as seen above this text in the navigation bar.
The Walkthrough tab (this one!) provides a step-by-step walkthrough of the pipeline, including common errors. All lines of code are numbered to easily find where you are in the pipeline.
The Flowchart tab is a brief overview of the pipeline, including what outputs are given. The flowchart matches the navigation bar on the left, so it’s easy to find steps you’re looking for :)
The Variables to be Changed tab includes all lines of code within the pipeline that may vary based on your data/file directories: list of changes that may need to be made based on your data.
The Troubleshooting tab contains common errors that may arise .
The Outputs tab provides a list of outputs from this pipeline, as well as descriptions as to what information is within them.
The Server tab is a guide for moving onto the server (pipeline and files), as well as moving files off the server. If you are just beginning to use ARC, instructions and available files that will allow the pipeline to run smoothly are here!
The Considerations tab is a list of things to consider when analyzing data.
The Graphing tab includes how to create a basic interactive barplot that allows for the easy reviewing of data.
To the left of every page is a navigation bar: this allows you to quickly find any part of the pipeline/information you need.
Any lines of code with a ‘#’ in front of it will not run.
Recommended Workflow
If you want a walkthrough of the entire pipeline and how it works:
If you are familiar with running pipelines:
Use the Flowchart for a brief overview of the types of outputs/nature of the pipeline
Use Variables to be Changed to quickly find what you need to change to run the pipeline
Check the Considerations for running the pipeline
Consider Graphing the results for a visual analysis of the output
If you are not familiar with running pipelines:
Files Needed
Ensure you know where all the following files are located on your computer (file directories):
Samples - these will be fastq.gz files. Often named as such: 1_1S_L001_R1_001.fastq.gz for the forward reads and 1_1S_L001_R2_001.fastq.gz for the reverse reads. Unfortunately, as of now samples that do not make it through the filtering steps of the pipeline will need to be removed manually (more on this in the filtering sections of the walkthrough).
Database - find the Nematode ITS2 database for dada2 here: https://www.nemabiome.ca/its2-database.html You will find it under the Analysis tab, called ITS2 Database. The dada2 format works for IdTaxa and AssignTaxonomy, the two classification algorithms that use the database.
R file - this is the pipeline. It will be called FinalPipelineSERVER.R for the server version, and FinalPipelineLOCAL.R for a local version.
Packages
These are packages this pipeline needs. They contain various functions used within the code.
If there is a problem with loading packages, check the Troubleshooting tab.
5 library(DECIPHER)
6 library(dada2)
7 library(ShortRead)
8 library(Biostrings)
9 library(ggplot2)
10 library(phyloseq)
11 library(dplyr)
12 library(reshape2)
13 library(GenomeInfoDbData)
14 library(openxlsx)
15 library(BiocManager)
16 library(stringr)
Setting Path to Samples
Here, you are telling R where your samples are located. This directory will need to be changed depending on where your samples are on the server/locally. The Variables to be Changed tab contains all directory changes in a more concise manner.
23 path <- "/home/jill.derijke/Test10Gdeer" # server
The “pattern” below is the end of your sample file names - this pattern must match your samples! With this pattern forward and reverse reads are distinguished (in this case R1 = forward reads and R2 = reverse reads).
25 fnFs <-sort(list.files(path, pattern = "_R1_001.fastq.gz", full.names = TRUE))
26 fnRs <-sort(list.files(path, pattern = "_R2_001.fastq.gz", full.names = TRUE))
Locating Primers + Pre-Filtering
The lines below will need to be changed if the primers used for your data are different.
33 FWD <-"ACGTCTGGTTCAGGGTTGTT" # NC1 primer
34 REV <- "TTAGTTTCTTTTCCTCCGCT" # NC2 primer
Now that the sequence of the primers used are identified, the code below allows these primers to be found in all orientations.
36 allOrients<-function(primer) {
37 require(Biostrings)
38 dna<-DNAString(primer)
39 orients<-c(Forward=dna, Complement=complement(dna), Reverse=reverse(dna),
40 RevComp=reverseComplement(dna))
41 return(sapply(orients, toString)) #now converting back into character vector
42 }
43 FWD.orients <- allOrients(FWD)
44 REV.orients <- allOrients(REV)
If run locally, running the below lines will show the primer sequence in all orientations.
If run on the server, this can be seen by viewing the “test_0000000.out” output file.
46 FWD.orients # to view output
47 REV.orients
Lines 50 and 51 create a file for all filtered reads, which will be outputted in the same file your samples are in, called “filtN”.
50 fnFs.filtN <- file.path(path, "filtN", basename(fnFs))
51 fnRs.filtN <- file.path(path, "filtN", basename(fnRs))
The line below is the pre-filtering step: it removes Ns from the sequences to allow for the removal of primers.
52 filterAndTrim(fnFs, fnFs.filtN, fnRs, fnRs.filtN, maxN = 0, multithread = 12) # pre-filtering to remove Ns
Now, the number of primers in all orientations will be counted with the code below.
#counting number of primers
primerHits <- function(primer, fn) {
# Counts number of reads in which the primer is found
nhits <- vcountPattern(primer, sread(readFastq(fn)), fixed = FALSE)
return(sum(nhits > 0))
}
primer_counts <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.filtN[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.filtN[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.filtN[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.filtN[[1]]))
The line below saves a count of the number of primers present in your samples as an rds file
saveRDS(primer_counts, "primer_countsBEFORE.rds")
Cutadapt - Primer Removal
The following block of code will remove primers from your reads.
You will need to change the directory for this external program to wherever it is on the server/computer.
# Load cutadapt
cutadapt <- "/home/jill.derijke/miniconda3/bin/cutadapt" # server // change to the cutadapt path on your machine
system2(cutadapt, args = "--version") # Run shell commands from R
The lines below create a file with all cutadapted samples, again found in the same file as your samples.
#now creating output filenames for cutadapted files, and define parameters to give cutadapt command
#critical parameters are the primers and they need to be in the right orientation
path.cut <- file.path(path, "cutadapt")
if(!dir.exists(path.cut)) dir.create(path.cut)
fnFs.cut <- file.path(path.cut, basename(fnFs))
fnRs.cut <- file.path(path.cut, basename(fnRs))
FWD.RC <- dada2:::rc(FWD) # these lines allow for the reverse-complement of the primer to be found
REV.RC <- dada2:::rc(REV)
# Trim FWD and the reverse-complement of REV off of R1 (forward reads)
R1.flags <- paste("-g", FWD, "-a", REV.RC)
# Trim REV and the reverse-complement of FWD off of R2 (reverse reads)
R2.flags <- paste("-G", REV, "-A", FWD.RC)
The following code contains the actual cutadapt command, which will remove primers from the reads. Parameters should not be changed here, as cutadapt is a little tricky to get working and the command line parameters are harder to understand. However, the documentation for cutadapt is linked in the Links tab if wanted (also to better understand the tool).
# Run Cutadapt
for(i in seq_along(fnFs)) {
system2(cutadapt, args = c(R1.flags, R2.flags, "-n", 2, # -n 2 required to remove FWD and REV from reads
"-o", fnFs.cut[i], "-p", fnRs.cut[i], # output files
fnFs.filtN[i], fnRs.filtN[i])) # input files
}
After running cutadapt, primers should be removed! This can and should be checked by the lines below, which will search for primers in your reads. This information will be saved as an rds file as well which will be able to be imported back into R locally to view.
# remember to check things - can check presence of primers in the first cutadapted sample
primercut_counts <- rbind(FWD.ForwardReads = sapply(FWD.orients, primerHits, fn = fnFs.cut[[1]]),
FWD.ReverseReads = sapply(FWD.orients, primerHits, fn = fnRs.cut[[1]]),
REV.ForwardReads = sapply(REV.orients, primerHits, fn = fnFs.cut[[1]]),
REV.ReverseReads = sapply(REV.orients, primerHits, fn = fnRs.cut[[1]]))
saveRDS(primer_counts, "primer_countsAFTER.rds")
The lines below sort forward and reverse cutadapted reads respectively, and keep them seperate to be further worked through the pipeline. If file names were changed earlier in the pipeline (see Setting Path to Samples), ensure they match the file names defined here!
# Forward and reverse fastq filenames have the format:
cutFs <- sort(list.files(path.cut, pattern = "_R1_001.fastq.gz", full.names = TRUE)) #remember to match file names
cutRs <- sort(list.files(path.cut, pattern = "_R2_001.fastq.gz", full.names = TRUE))
Main Filtering
#filtering reads
filtFs <- file.path(path.cut, "filtered", basename(cutFs))
filtRs <- file.path(path.cut, "filtered", basename(cutRs))
All of the parameters are default except for the maxEE parameter, which has been recommended to be more relaxed for reversed reads. A full explanation of all parameters found below can be found in the documentation, located within the Links tab. Briefly, maxN removes reads with more Ns than maxN specifies. The pipeline does not allow Ns to be included within the reads, so this is set to 0. maxEE specifies the maximum amount of expected errors allowed in forward and reverse reads respectively. Eg. 2 expected errors allowed in forward reads, and 5 allowed in reverse reads. The reverse read parameter has been relaxed due to the inherent worse quality of reverse reads. truncQ reads are truncated as soon as the quality becomes worse than the truncQ value specified. minLen removes reads that are shorter than the specified length (in this case 50) - after the other filtering steps are completed. rm.phix removes reads that match the phiX genome. There are many more parameters as can be found through viewing the documentation.
out <- filterAndTrim(cutFs, filtFs, cutRs, filtRs, maxN = 0, maxEE = c(2, 5),
truncQ = 2, minLen = 50, rm.phix = TRUE, compress = TRUE, multithread = 12)
The number of reads that made it into the filtering step and the number that made it out by sample can be seen in the final summary under the TrackReads tab. Depending on your samples this is an indicator of the quality of your reads.
The lines below list the names of your samples to be used later on. It obtains the sample name through splitting the full file name. For example, the sample file called 1_1S_L001_R1_001.fastq.gz will be split at the specified pattern “001_R” to become 1_1S_L.
# Extract sample names, assuming filenames have same format, may need to alter!!!
get.sample.name <- function(fname) strsplit(basename(fname), "001_R")[[1]][1]
sample.names <- unname(sapply(filtFs, get.sample.name))
sample.names
Below is the learning errors phase of the pipeline, where it learns the error rates of your samples. See the documentation for the LearnErrors function in the [Links tab] for more information!
#learning errors
errF <- learnErrors(filtFs, multithread = 12)
errR <- learnErrors(filtRs, multithread = 12)
The lines below save the results from learning errors to an rds file, which can be imported back into R and viewed locally. In the RDS_Graphing_FinalPipeline.R file, there will be code to enable this information to be seen as a plot.
saveRDS(errF, "errF.rds")
saveRDS(errR, "errR.rds")
These lines of code denoise the data, and ensure there are no replicate reads.
# denoising data
derepFs <- derepFastq(filtFs, verbose = TRUE)
derepRs <- derepFastq(filtRs, verbose = TRUE)
Below sample inferencing occurs. The paper explaining the algorithm can be found in the Links tab. To my understanding, this resolves the reads to sequence variants.
#sample inference - at this step the core sample inference algoithm is applied to the dereplicated data
dadaFs <- dada(derepFs, err = errF, multithread = 12)
dadaRs <- dada(derepRs, err = errR, multithread = 12)
Merge and Concatenate Reads
With ITS2 data, there is a possibility of the forward and reverse read not overlapping. With the normal dada2 pipeline, there is one function that merges OR concatenates forward and reverse reads, discarding those that do not overlap sufficiently or concatenating them all with 10 Ns between them. The code below combines these processes in a sequential manner so all reads that did not merge sufficiently are concatenated instead. A link to the source code and a little explanation from the writer is included in the Links section. A diagram is included below for added clarity.
The code below merges all reads 3 times:
1. mergedData: Using the most stringent merging parameter with 0 mismatches allowed in the overlap
2. maxMismatch: Using a relaxed merging parameter with 4 mismatches allowed in the overlap
3. concats: Concatenating the reads - merging them with 10 Ns
# prints files to those connections
# maxMismatch=0 has higher penalty for mismatch and gap for nwalgin (-64)
cat("\n\nperfect match merging....\n")
mergedData <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, maxMismatch=0, returnRejects=TRUE, verbose=TRUE)
# maxMismatch > 0 has lower penalty for mismatch and gap for nwalgin (-8)
cat("\n\nmismatch allowed merging....\n")
mismatch_mergers <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, maxMismatch=4, returnRejects=TRUE, verbose=TRUE)
cat("\n\nconcatenation merging....\n")
concats <- mergePairs(dadaFs, derepFs, dadaRs, derepRs, justConcatenate=TRUE, verbose=TRUE)
The code below is not run (denoted by the # in front of each line). The reason it is not ‘activated’ is because it creates a file for each sample of all reads that were loosely merged, strictly merged, and concatenated. This means 3 files for every sample, which becomes a lot of files - turning 10 sample files into 10 + (3 x 10). It is advised to leave this so that it does not run unless needed/wanted or when running a few samples.
# # save sample merge dfs for diff comparisions
# sapply(names(mergedData), function (x) write.table(mergedData[[x]], file=paste(x, "strict_mergers.txt", sep="_"),
# sep="\t", row.names=FALSE, quote=FALSE) )
#
# sapply(names(mismatch_mergers), function (x) write.table(mismatch_mergers[[x]], file = paste(x, "loose_mergers.txt", sep="_"),
# sep="\t", row.names=FALSE, quote=FALSE) )
#
# # converting concat df to matrix to overcome error of - unimplemented type 'list' in 'EncodeElement'
# sapply(names(concats), function (x) write.table(as.matrix(concats[[x]]), file=paste(x, "all_concats.txt", sep="_"),
# sep="\t", row.names=FALSE, quote=FALSE) )
The code below seems to only keep rows that have content. These lines were not added by me but is needed for the rest of the merge and concatenate process to run smoothly.
# TEST --> If the row has 0 rows, then remove it from list?
mergedData_test <- mergedData[sapply(mergedData, function(x) dim(x)[1]) > 0]
mismatch_mergers_test <- mismatch_mergers[sapply(mismatch_mergers, function(x) dim(x)[1]) > 0]
concats_test <- concats[sapply(concats, function(x) dim(x)[1]) > 0]
In the lines below, the code creates a “rowsToDelete” vector so - for example - ASVs that are concatenated that merged perfectly can be dropped later (as we only want to keep the best merging of the forward and reverse read).
# replace mismatched or concatenated ASVs in the main mergedData
for(i in names(mergedData_test)) {
# store row index to drop certain ASVs later
rowsToDelete = vector()
The lines below create dataframes of the 3 versions of merged/concatenated reads, creating mergedDf, mismatchDf, and concatDf.
mergedDf = mergedData_test[[i]]
cat(i, "Out of total", sum(mergedDf$abundance), "paired-reads (in", nrow(mergedDf), "unique pairings), retained ")
mismatchDf = mismatch_mergers_test[[i]]
rownames(mismatchDf) = paste(mismatchDf$forward, mismatchDf$reverse, sep = "_")
concatDf = concats_test[[i]]
rownames(concatDf) = paste(concatDf$forward, concatDf$reverse, sep = "_")
Below, any rows in the mergedDf (which are attempted perfectly merged reads) that are accepted due to a good merge will be skipped, or accepted as merged reads.
for (row in 1:nrow(mergedDf)) {
# skipping rows that are good to go from default analysis
if (mergedDf[row,]$accept) { next }
uniquePairID = paste(mergedDf[row,]$forward, mergedDf[row,]$reverse, sep="_")
If the code detects a match length within the mergedDf that is less than 12, then the overlap is not sufficient and those reads will be replaced with the results of those reads from the concatDf (which contains only concatenated reads).
# if match length is less than 12 then good to go with concatenation
if (mismatchDf[uniquePairID,]$nmatch <= 12) {
mergedDf[row,] = concatDf[uniquePairID,]
}
ASVs with a long overlap and many mismatches are ignored however, which reduces bias due to a longer overlap between forward and reverse reads naturally increasing the probability of mismatches present within the lengthy overlap. The code does this in “ranks”, so not to just ignore the same number of mismatches in various lengths of overlap.
# ignoring ASVs with many mismatches with long overlap
else{
misMatchIndels = mismatchDf[row,]$nmismatch + mismatchDf[row,]$nindel
cutOff = 0
The line below enforces a cutoff of 1 mismatch in a 50 bp or less overlap.
if ( mismatchDf[uniquePairID,]$nmatch > 12 && mismatchDf[uniquePairID,]$nmatch <= 50 ) {
cutOff = 1
}
The line below enforces a cutoff of 2 mismatches in a 100 bp or less overlap.
else if ( mismatchDf[uniquePairID,]$nmatch > 50 && mismatchDf[uniquePairID,]$nmatch <= 100 ) {
cutOff = 2
}
The line below enforces a cutoff of 3 mismatches in an overlap greater than 100 bp.
else if (mismatchDf[uniquePairID,]$nmatch > 100) {
cutOff = 3
}
# check if mismatches are below cut off
if (misMatchIndels <= cutOff) {
mergedDf[row,] = mismatchDf[uniquePairID,]
}
# discard if mismatches are high in longer than 100bp match
else if (cutOff==3 && misMatchIndels > 3) {
rowsToDelete = c(rowsToDelete, row)
}
The code below will concatenate reads if the number of mismatches is too high in the overlap region. However, the overlap will still be on the ends of the reads (eg. if the end of the forward read has the sequence ATATATATAT, the reverse read will have this same sequence at the beginning of it if they overlap). Thus, the overlap region of the reverse read is trimmed so they are concatenated without the overlap region being present twice.
# concatenate if mismatches are high in overlap region remove reverse read part of the overlap region
else {
trimLength = mismatchDf[uniquePairID,]$nmatch
concatDf[uniquePairID,]$sequence = gsub(paste0("N{10}[A-z]{", trimLength, "}"), "NNNNNNNNNN", concatDf[uniquePairID,]$sequence)
mergedDf[row,] = concatDf[uniquePairID,]
}
}
}
The lines below delete the “rowsToDelete” vector that was created earlier.
if (length(rowsToDelete) == 0){
mergedData_test[[i]] = mergedDf
} else {
mergedData_test[[i]] = mergedDf[-rowsToDelete,]
}
cat(sum(mergedData_test[[i]]$abundance), "paired-reads (in", nrow(mergedData_test[[i]]), "unique pairings)\n")
}
At this point, reads are merged and/or concatenated to the best of their ability, with a very minimal number of reads lost! The number of reads lost at this step will be able to be viewed in the TrackReads tab of the final summary - not many should be lost. The lines below save lists of the different merging parameter for each ASV as rds files, to be imported back into R and viewed locally. The ASV sequence, abundance of the ASV, number of mismatches, and whether or not the ASV was accepted as a good merge for the respective parameters are listed by sample, as well as other information.
saveRDS(mergedData_test, "mergedData_test.rds")
saveRDS(mismatch_mergers_test, "mismatch_mergers_test.rds")
saveRDS(concats_test, "concats_test.rds")
Sequence Table
With the ASVs ready, the lines below create an Amplicon Sequence Variant table and attempt to remove chimeras.
#can now construct a sequence table - Amplicon Sequence Variant Table (ASV), a higher-resolution version of the OTU table
seqtab <- makeSequenceTable(mergedData_test)
#Remove chimeras
seqtab.nochim <- removeBimeraDenovo(seqtab, method="consensus", multithread=12, verbose=TRUE)
The distribution of lengths of the ASVs can be viewed within the test_000000.out file on the server.
# inspect distribution of sequence lengths
table(nchar(getSequences(seqtab.nochim)))
Tracking Reads
It is important to view the output of this step - it tracks the reads through the pipeline, and tells you how many reads were lost at each point - including all filtering stages. Depending on your data, if too many reads are being lost at the filtering stage parameters may need to be loosened. The information from these lines is in the final summary excel file of the pipeline!
#track reads through the pipeline - inspect the number of reads that made it through each step in the pipeline -- IMPORTANT
getN <- function(x) sum(getUniques(x))
track <- cbind(out, sapply(dadaFs, getN), sapply(dadaRs, getN), sapply(mergedData_test, getN), rowSums(seqtab.nochim))
# If processing a single sample, remove the sapply calls: e.g. replace
# sapply(dadaFs, getN) with getN(dadaFs)
colnames(track) <- c("input", "filtered", "denoisedF", "denoisedR", "merged",
"nonchim")
rownames(track) <- sample.names
#TOTAL READS
group <- sample(6)
colSums(track, group)
Classification: IdTaxa
IdTaxa is a relatively new classification algorithm, which classifies the ASVs from the pipeline to the highest resolution it can using a database. A link to the documentation of IdTaxa is included within the Links tab.
The lines below do not run, as is denoted by the ‘#’ in front of the line. This line allows you to ‘set a seed’, specifying a random number within the brackets to ensure reproducibility if repeating the run. Occasionally, there are differences in resolution in the classifications which setting a seed supposedly helps. I am unsure if it works on the server, but I have left the code in just incase it is wanted to be looked into further.
#set seed when doing repeats - will keep result the same with repeats
# set.seed(123) # seed is a random number
The line below allows R to find the database to be used to classify sequences and reads it. The directory will need to be changed based on where your database file is located.
fasta <- "/home/jill.derijke/dada2.fasta" # server
database <- readDNAStringSet(fasta)
The block of code below is taking the headers from the database reference sequences to create a taxonomy file IdTaxa will use to ‘learn’ the database. More information on these processes can be found in the documentation within the Links tab.
###### IDTAXA #####
#parse headers to obtain taxonomy
s <- strsplit(names(database), ";")
domain <-sapply(s, `[`, 1)
kingdom <-sapply(s, `[`, 2)
phylum <- sapply(s, `[`, 3)
class <- sapply(s, `[`, 4)
order <- sapply(s, `[`, 5)
family <- sapply(s, `[`, 6)
genus <- sapply(s, `[`, 7)
species <-sapply(s, `[`, 8)
taxonomy <- paste("Root", domain, kingdom, phylum, class, order, family, genus, species, sep=";")
taxonomat<- as.matrix(taxonomy)
In the lines below, the LearnTaxa function will train IdTaxa on the information within the database.
#training the classifier
trainingSet<-LearnTaxa(database, taxonomy)
An rds file saves the training set to be visualized locally.
saveRDS(trainingSet, "trainingSet.rds")
The lines below create an object called ‘test’. This is an arbitrary name, but contains the ASVs in a readable format to be read by IdTaxa for classification.
#creating dna string set from the ASVs (from dada2 tutorial) to use in IDTAXA as "test" (converting it to stringset object with same format)
test <- DNAStringSet(colnames(seqtab.nochim))
The line below ensures the ASVs in the ‘test’ object are named properly to ensure the ASVs can be traced back to their respective samples.
names(test) <- as.character(1:ncol(seqtab.nochim))
Below is the first classification part of the pipeline. IdTaxa classifies the ASV based on the database provided to it earlier. Parameters can be changed here, the most important one being threshold=60, which is the default. It may leave many ASVs classified only to the genus level depending on your data, however it is unadvised to loosen this parameter at the risk of sacrificing accuracy for resolution. Again, learn more about IdTaxa through documentation in the Links section.
##### PARAMETERS TO CHANGE ##### (line below)
# classifying
ids <- IdTaxa(test, trainingSet, threshold = 60, bootstraps = 100, strand = "both", processors = NULL, verbose = TRUE, type="extended")
The line below provides the number of ASVs, which can be a nice double check at the end of the pipeline that none have mysteriously gone missing.
length(ids) # - gives you number of ASVs
The “ids” object itself is a file which has all the confidence results for the ASVs. This information can also be found in the final excel sheet summary under IdTaxaConfidence, but is also saved as an rds file to be imported into R locally to view.
saveRDS(ids, "ids.rds")
The lines below organize the confidence values by rank.
idsFunstuff <- function(i){
ids[[i]][['rank']]<-c("rootrank", "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
return(ids[i])
}
idsdone <- sapply(1:length(ids), idsFunstuff)
IdTaxa Classification Table
The dada2 pipeline uses specific matrices after classification that summarize the information very well and includes sample names and read abundances, whereas the IdTaxa classification method only outputs the “ids” object mentioned above. Hence, the IdTaxa output needs to be modified to a matrix better suited to the pipeline, as is done with the following code:
# Needed to convert back to analogous taxa matrix from dada2 pipeline tutorial
Species <- sapply(idsdone,
function(x) {
w <- which(x$rank=="species")
if (length(w) != 1) {
"unknown"
} else {
x$taxon[w]
}
})
table(Species)
Genus <- sapply(idsdone,
function(x) {
w <- which(x$rank=="genus")
if (length(w) != 1) {
"unknown"
} else {
x$taxon[w]
}
})
table(Genus)
taxon <- sapply(ids,
function(x)
x$taxon[length(x$taxon)])
# Converting the output object of class "Taxa" to a matrix analogous to the output from assignTaxonomy
ranks<- c( "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
taxid <- t(sapply(idsdone, function(y) {
m <- match(ranks, y$rank)
taxa<- y$taxon[m]
taxa
}))
colnames(taxid) <- ranks
rownames(taxid) <- getSequences(seqtab.nochim)
taxa<- taxid
taxa.print <- taxa # Removing sequence rownames for display only
rownames(taxa.print)<-NULL
The lines below create a dataframe that is only names of the samples derived from the file names to eventually use further down the pipeline. IMPORTANT: If there are duplicate names, R will merge them - ensure they are unique!!.
## constructing df of sample data (not using any variables other than x kind of sample)
## if there are duplicate names, R will merge them so make sure they're unique
samplenam<- rownames(seqtab.nochim)
The first line below can be altered based on your data, but ensure the sample name is obtained at a point that will ensure no sample names are the same. For example, if I specify the file name to split at “_S" for two files that are named 193*_S193*_L001_R1_001.fastq.gz and 193*_S342*_L001_R1_001.fastq.gz, I will end up with sample names that are 193 and 193, even though the samples are different. The information for these files will merge, and will break the rest of the pipeline. As of now, the sample name is split from “001_R”, which is typically what works for the fastq files inputted. The other two lines create and format the sample name dataframe properly.
nameSample <-sapply(strsplit(samplenam, "001_R"), `[`, 1)
dfagain<-data.frame(Samples=nameSample)
rownames(dfagain)<-samplenam
The phyloseq package is an R package that is created to analyze microbiome data. It is used here to format data in more user-friendly and clear way. More information will be available within the Links tab. Once properly organized, it is converted into a dataframe for easier use/manipulation.
phyABC<-phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(dfagain), tax_table(taxa))
physp <- tax_glom(phyABC, "species") # collapses them properly
classif <- psmelt(physp) #converts into dataframe
colnames(classif)[1]<- c("ASV")
classif$ASV <- NULL
classif$Sample <- NULL
### ORGANIZING DATA INTO TABLE
AllInfoClassif <- tax_table(phyABC)@.Data
seqTable <- as.data.frame(t(seqtab.nochim))
SeqTabAndClassifiedInfo <- cbind(AllInfoClassif, seqTable)
Below, a column is added listing the ASVs.
## add column with ASV #
SeqTabAndClassifiedInfo$ASV <- c("ASV")
SeqTabAndClassifiedInfo$ASV <- paste0("ASV_", seq.int(nrow(SeqTabAndClassifiedInfo)))
The line below adds another column where the ASV and classification are jointly listed. This is for the purpose of graphing and to ensure the ASV and classification matching did not somehow get messed up further down the pipeline.
## getting header to include ASV and species name - adding column with both
SeqTabAndClassifiedInfo$ASVnumber <- paste(SeqTabAndClassifiedInfo$species, SeqTabAndClassifiedInfo$ASV, sep = "-")
Classification: AssignTaxonomy
This is the second classification algorithm that will classify ASVs within this pipeline. More information on AssignTaxonomy can be found in the Links tab, but briefly AssignTaxonomy is what originally came with the Dada2 pipeline, and after testing and various comparisons with IdTaxa there are multiple observations. AssignTaxonomy is less conservative than IdTaxa, and sometimes makes assumptions, often sacrificing accuracy for resolution. As well as this, less information is provided about the classifications. There is also less wiggle room in terms of parameters that can be altered. It has been left as part of the pipeline as it can be useful for a second check on top of BLAST.
ASSIGNTAXAtaxa <- assignTaxonomy(seqtab.nochim, fasta, multithread = 12, tryRC = TRUE, taxLevels = c("domain","kingdom", "phylum","class", "order","family", "genus", "species"))
The phyloseq package is once again used for data formatting.
phyAssignTaxa<-phyloseq(otu_table(seqtab.nochim, taxa_are_rows = FALSE), sample_data(dfagain), tax_table(ASSIGNTAXAtaxa))
AssignTaxAllInfo <- tax_table(phyAssignTaxa)@.Data
The lines below summarize how many reads were classified to the highest resolution (species) by each classification algorithm.
AssignedToSpAT <- sample_sums(tax_glom(phyAssignTaxa, "species")) # summary of reads to check
AssignedToSpIdTax <- sample_sums(tax_glom(phyABC, "species")) # summary of reads to check
The lines below further format data for the final summary, including combining dataframes and including columns containing ASV and classification information.
ATSeqTabAndClassifiedInfo <- cbind(AssignTaxAllInfo, seqTable)
## add column with ASV #
ATSeqTabAndClassifiedInfo$ASV <- c("ASV")
ATSeqTabAndClassifiedInfo$ASV <- paste0("ASV_", seq.int(nrow(ATSeqTabAndClassifiedInfo)))
## getting header to include ASV and species name - adding column with both
ATSeqTabAndClassifiedInfo$ASVnumber <- paste(ATSeqTabAndClassifiedInfo$species, ATSeqTabAndClassifiedInfo$ASV, sep = "-")
Creating Fasta File of ASVs
This block of code creates a fasta file of the ASVs which will be used by BLAST.
##### Make FASTA file of ASVs #####
# Should be same regardless of IdTaxa or assigntaxonomy.
Below, sample names are extracted from file names. Ensure no two samples are named the same when specifying a pattern to split the file name by. The IdTaxa Classification Table explains this topic further.
samplenames <- sapply(strsplit(basename(filtFs), "_L001_R"), '[', 1)
The code below formats the ASV sequence table into a fasta file.
mname <- "taxid"
getseq <- function(i){
colnames(seqtab.nochim)
}
asv_seqs <- lapply(1, getseq)
asv_seqs <- unlist(asv_seqs[[1]])
getheads <- function(reg){
asv_head <- vector(dim(seqtab.nochim)[2], mode="character")
for (i in 1:length(asv_head)) {
asv_head[i] <- paste0(">ASV_", i)
}
return(asv_head)
}
asv_headers <- lapply(1, getheads)
asv_headers <- unlist(asv_headers)
asv_fasta <- c(rbind(asv_headers, asv_seqs))
The line below saves the fasta file.
write(asv_fasta, file = "ASV.fasta", sep = "/")
BLAST
This is the final classification method of the pipeline. The ASVs are BLASTed through NCBI. There are important directories that must be changed here!!
# Same regardless of using IdTaxa or AssignTaxonomy
# Import blastn python package*
Change the path to this to tell R where the blastn file is. If using the server, the path should be the same other than the username.
blastn <- "/home/jill.derijke/miniconda3/bin/blastn" # server
The fasta file from before is now imported back into the pipeline to be used by BLAST. Change the path to where it was written to. If the fasta file name was changed when writing the file, it must be changed here too.
# Load the ASV fasta file to be blastn-ed. Ensure directory is correct
FAS <- "/home/jill.derijke/ASV.fasta" # server
This is the home directory, and where the output file of BLAST will end up. It is important to know where these files are going!! Change the directory to where the file will go (if anything just change the username).
# Set location of output file of blast results
homedir <- "/home/jill.derijke/" # server
The code below is the actual BLAST process. It is slow!! Note the file “BlastResults.out”. The name for this file can be changed, but must be changed later on as well. It should be noted that the parameter max_target_seqs is not necessarily the top 5 hits. A paper highlighting this issue and its impacts on bioinformatic analysis is linked in the Links tab.
# blastn the ASVs ; TAKES A WHILE!!! Loosened blastn parameters of what it should return as a hit (i.e. keep everything)
system2(blastn, paste("-db nt -query ", FAS, "-out",
paste0(homedir,"BlastResults.out"), "-remote", "-max_target_seqs", 5, "-qcov_hsp_perc", 25, "-perc_identity", 0,
"-task blastn",
"-outfmt", "\"6 qseqid sseqid sscinames pident length mismatch gapopen qstart qend sstart send evalue bitscore score
qgi qacc qaccver qlen sallseqid sgi sallgi sacc saccver sallacc slen nident gaps frames qframe sframe staxids scomnames
sblastnames sskingdoms stitle salltitles sstrand qcovs\""))
# BLAST is run on the server. The above is what is used to do so (with modified inputs and outputs).
To import BLAST results for further use, import the output file back into the pipeline with the line below. This will need to be changed to the path to your file.
# Import & look at BLAST results + the outputs
blast_asv <- read.table(file = "/home/jill.derijke/BlastResults.out",
sep = "\t", header = FALSE, stringsAsFactors = FALSE, fill = TRUE) # server
The lines below create a table format of the BLAST results that is compatible with the rest of the pipeline.
blast_asv_title <- select(blast_asv, "ASV" = V1, "BLAST seqid" = V2, "identity" = V4, "bitscore" = V13, "Assignment" = V33)
write.table(blast_asv_title, file = "BLASTResults.csv")
Formatting Data for Summary
The code within this section includes data cleanup for the final excel workbook containing all significant classification information. At this stage in the pipeline, the hits from BLAST are listed in column format, meaning 5 hits for one ASV are listed one below the other. Leaving the data formatted in this way creates a very long table, with (number of BLAST hits) x (number of ASVs) = number of rows. The number of rows will easily reach the 1000s, thus the table is reorganized to list the ASVs one by one and have the BLAST hits horizontally listed by ASV in a row.
The code below selects unique ASVs (eg ASV_1 and ASV_2) and collapses the results. In this way, the dataframe that was
ASV_1 blast hit 1
ASV_1 blast hit 2
ASV_1 blast hit 3
is now
ASV_1 blast hit 1 blast hit 2 blast hit 3
# Get number of unique ASVs, then collapse
u <- unique(blast_asv_title[,1])
for(i in u) {
n <- dim(blast_asv_title[blast_asv_title$ASV == i, ])[1]
print(1:n)
blast_asv_title$n[blast_asv_title$ASV == i] <- 1:n
print(blast_asv_title$n)
}
blast_asv_FINAL <- reshape(blast_asv_title, idvar = "ASV", timevar = "n", direction = "wide")
The line below takes the dataframe of classifications and combines it with the BLAST results by ASV, resulting in one large dataframe containing all classification information as well as BLAST hits corresponding to each ASV.
## Combining these two dataframes for classifications and BLAST
FinalSummary <- merge(SeqTabAndClassifiedInfo, blast_asv_FINAL, by = "ASV", all=TRUE)
The code below formats the object containing IdTaxa confidence results into data that is compatible for input into an excel sheet, as well as corresponding the information to respective ASVs and samples so it is all traceable. The end result is a dataframe organized by ASV that presents the confidence percentage for each taxon as is determined by IdTaxa. In theory, it is a simple organizational concept. However, a lot of code was needed to properly execute it.
##### Final summary of IdTaxa classification confidences #####
tryingC <- sapply(ids,
function(x)
paste(x$confidence,
collapse=";"))
tryingT <- sapply(ids,
function(x)
paste(x$taxon,
collapse=";"))
Tdf <- as.data.frame(tryingT)
Cdf<- as.data.frame(tryingC)
## Making columns
stringr::str_split_fixed
Cdf<- str_split_fixed(Cdf$tryingC, ";", 9) ## works
colnames(Cdf) <- c("root","domain", "kingdom", "phylum", "class", "order","family", "genus", "species")
rownames(Cdf) <- SeqTabAndClassifiedInfo$ASVnumber
Tdf <- str_split_fixed(Tdf$tryingT, ";", 9) ## works
colnames(Tdf) <- c("root","domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
rownames(Tdf) <- SeqTabAndClassifiedInfo$ASVnumber
TDF <- as.data.frame(Tdf)
CDF <- as.data.frame(Cdf)
TestDF <- TDF
TestDF$ASV <- paste(SeqTabAndClassifiedInfo$ASV) ## changed
TestDF$root2 <- paste(TestDF$root, CDF$root, sep = "-") ## combining tax and confidence columns
TestDF$domain2 <- paste(TestDF$domain, CDF$domain, sep = "-") ## combining tax and confidence columns
TestDF$kingdom2 <- paste(TestDF$kingdom, CDF$kingdom, sep = "-") ## combining tax and confidence columns
TestDF$phylum2 <- paste(TestDF$phylum, CDF$phylum, sep = "-") ## combining tax and confidence columns
TestDF$class2 <- paste(TestDF$class, CDF$class, sep = "-") ## combining tax and confidence columns
TestDF$order2 <- paste(TestDF$order, CDF$order, sep = "-") ## combining tax and confidence columns
TestDF$family2 <- paste(TestDF$family, CDF$family, sep = "-") ## combining tax and confidence columns
TestDF$genus2 <- paste(TestDF$genus, CDF$genus, sep = "-") ## combining tax and confidence columns
TestDF$species2 <- paste(TestDF$species, CDF$species, sep = "-") ## combining tax and confidence columns
FinalConfSum <- select(TestDF,-c(1:9))
colnames(FinalConfSum) <- c("ASV", "root","domain", "kingdom", "phylum", "class", "order", "family", "genus", "species")
The code below formats all information regarding read counts into one dataframe. Information is taken from three dataframes, each containing numerical data on reads including a sum after each major filtering step in the pipeline, the number of reads classified to the species level for AssignTaxonomy and IdTaxa respectively. This numerical data is combined into one dataframe to be included in the final summary.
# summary tracking reads through the pipeline
tttt <- as.data.frame(track)
tttt1 <- tibble::rownames_to_column(tttt, var="Samples")
atat1 <- as.data.frame(AssignedToSpAT)
atat2 <- tibble::rownames_to_column(atat1, var="Samples")
idid <- as.data.frame(AssignedToSpIdTax)
idid1 <- tibble::rownames_to_column(idid, var="Samples")
TrackReadsClassif <- merge(atat2, idid1, all=TRUE, by = "Samples" )
The lines below ensure formatting for the final summary excel sheet is consistant, and adds a column at the beginning of the data frames with ASV numbers.
## including the ASV as first row of sheet
ATSeqTabAndClassifiedInfo2 <- cbind(SeqTabAndClassifiedInfo$ASV, ATSeqTabAndClassifiedInfo)
SeqTabAndClassifiedInfo2 <- cbind(SeqTabAndClassifiedInfo$ASV, SeqTabAndClassifiedInfo)
The code below is creating a dataframe with all species classifications collapsed instead of by ASV. This way, it is easy to briefly review the species composition of all samples.
phyTEST <- tax_glom(phyABC, "species", NArm = FALSE) # collapses them properly
classifTEST <- psmelt(phyTEST) #keep - converts into dataframe
colnames(classifTEST)[1]<- c("ASV")
classifTEST$ASV <- NULL
classifTEST$Sample <- NULL
require(reshape2)
newdfTEST <- reshape(classifTEST, timevar = "Deer", idvar = c( "domain", "kingdom", "phylum", "class", "order", "family", "genus", "species"), direction = "wide")
names(newdfTEST) <- gsub(pattern = "Abundance.", replacement = "", x = names(newdfTEST), ignore.case = TRUE) #removing part of sample name that is not needed
Summarizing into Excel Workbook
This is the final step in the pipeline!! Now that all information is properly formatted, the code below converts it all into multiple excel sheets within one excel workbook for easy viewing! Each code section from here on is a sheet in the excel workbook.
wb <- createWorkbook() ## Save workbook
The sheet below includes collapsed IdTaxa classifications where IdTaxa classification results are collapsed by species.
# sheet of collapsed IdTaxa classifications (by species)
IdTaxCollapsedSheet <- addWorksheet(wb, sheetName = "CollapsedSpeciesIdTaxa", tabColour = "#800000")
writeData(wb, IdTaxCollapsedSheet, newdfTEST, colNames = TRUE, rowNames = FALSE)
The sheet below contains IdTaxa classifications by ASV.
# sheet of IdTaxa classifications
IdTaxSheet <- addWorksheet(wb, sheetName = "IdTaxaClassification", tabColour = "#CA0000")
writeData(wb, IdTaxSheet, SeqTabAndClassifiedInfo2, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#CA0000" )
addStyle(wb, "IdTaxaClassification", cols = 1, rows=1:(nrow(SeqTabAndClassifiedInfo2)+1), style = colForcol)
The sheet below contains the confidence values of each taxonomy rank by ASV.
# IdTaxa classification confidence values sheet
IdTaxConfidenceSheet <- addWorksheet(wb, sheetName = "IdTaxaConfidence", tabColour = "#FF0000")
writeData(wb, IdTaxConfidenceSheet, FinalConfSum, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF0000" )
addStyle(wb, "IdTaxaConfidence", cols = 1, rows=1:(nrow(FinalConfSum)+1), style = colForcol)
This sheet contains all BLAST results by ASV, with blast hits across a row.
# sheet of BLAST results (5 hits across a row)
blastSheet <- addWorksheet(wb, sheetName = "BLAST", tabColour = "#FF2C2C")
writeData(wb, blastSheet, blast_asv_FINAL, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF2C2C" )
addStyle(wb, "BLAST", cols = 1, rows=1:(nrow(blast_asv_FINAL)+1), style = colForcol)
The sheet below combines IdTaxa and BLAST classification results into one large dataframe if that is preferred for easier graphing/viewing.
# # sheet with IdTaxa and BLAST results combined
AllSheet <- addWorksheet(wb, sheetName = "All", tabColour = "#FF5C5C")
writeData(wb, AllSheet, FinalSummary, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF5C5C" )
addStyle(wb, "All", cols = 1, rows=1:(nrow(FinalSummary)+1), style = colForcol)
The sheet below includes all AssignTaxonomy classification results by ASV.
# sheet with assigntaxonomy results
ATSheet <- addWorksheet(wb, sheetName = "AssignTaxonomy", tabColour = "#FF8484")
writeData(wb, ATSheet, ATSeqTabAndClassifiedInfo2, colNames = TRUE, rowNames = FALSE)
colForcol <- createStyle(fgFill = "#FF8484")
addStyle(wb, "AssignTaxonomy", cols = 1, rows=1:(nrow(ATSeqTabAndClassifiedInfo2)+1), style = colForcol)
This sheet tracks reads through the pipeline by sample
## sheet tracking reads through the pipeline
TrackSheet <- addWorksheet(wb, sheetName = "TrackReads", tabColour = "#FFB3B3")
writeData(wb, TrackSheet, tttt1, colNames = TRUE, rowNames = FALSE)
The last sheet below contains how many reads were classified to the species level by AssignTaxonomy and IdTaxa respectively. AssignTaxonomy tends to have much less reads assigned to the species level.
## sheet with how many reads are classified to the species level
NumberReadsClassifiedSheet <- addWorksheet(wb, sheetName = "#ReadsClassified",tabColour = "#FFCCCC")
writeData(wb, NumberReadsClassifiedSheet, TrackReadsClassif, colNames = TRUE, rowNames = FALSE)
InfoSheet <- addWorksheet(wb, sheetName = "Info")
The line below saves the workbook, completing the pipeline! The final excel sheet can easily be renamed within the quotes if wanted. Keep the .xlsx file extension however.
saveWorkbook(wb, file = "FinalSummary.xlsx", overwrite = TRUE)
:)